-
Notifications
You must be signed in to change notification settings - Fork 584
add clusteroperator Upgradeable conditions #206
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
||
| // OperatorUpgradeable indicates whether or not the operator is in a state where it can accept a new payload. When | ||
| // set to `False` the CVO will disallow moving to a new version. | ||
| OperatorUpgradeable ClusterStatusConditionType = "Upgradeable" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we've already established that it's not necessary an "upgrade", and also because i still think there might be some value in using this to determine general readiness of an operator, i'd propose renaming it to Ready.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we've already established that it's not necessary an "upgrade", and also because i still think there might be some value in using this to determine general readiness of an operator, i'd propose renaming it to
Ready.
You cannot have the CVO base its decision about whether or not to allow a version change upon the operator status for the same reason you cannot allow it make a decision based on operand status. If you did, you would be unable to fix a broken operator with a version change. You're describing a different condition.
The upgradeable condition will block an upgrade and it should only be set in a case when we cannot safely allow a version change to proceed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You cannot have the CVO base its decision about whether or not to allow a version change upon the operator status for the same reason you cannot allow it make a decision based on operand status.
who's setting this "upgradeable" condition if not the operator?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You cannot have the CVO base its decision about whether or not to allow a version change upon the operator status for the same reason you cannot allow it make a decision based on operand status.
who's setting this "upgradeable" condition if not the operator?
Assuming you intend for Ready to mean your operator is functional. If an operator is Ready==false, then we should definitely allow upgrading that operator because otherwise it can be stuck broken. If an operator is Ready=true, then we should definitely allow upgrading that operator because it is functioning correctly. It's a condition unrelated to whether or not the CVO should allow an upgrade.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How/when do you define that operator is not ready (Ready=false) but upgradable (Upgradable=true)? What's the algorithm/conditions behind that decision? Is it human made?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How/when do you define that operator is not ready (
Ready=false) but upgradable (Upgradable=true)? What's the algorithm/conditions behind that decision? Is it human made?
I think we're looking at different conditions and I'd like to have this pull focus on Upgradeable. It is an operator made choice.
|
this is a little meta. i know we discussed this previously, but when would
an operator disallow moving to a new version? do you have a scenario in
mind where you would set this False with your current operator?
…On Mon, Feb 11, 2019 at 2:19 PM Ben Parees ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In config/v1/types_cluster_operator.go
<#206 (comment)>:
> @@ -124,6 +124,10 @@ const (
// The binary maintained by the operator (eg: openshift-apiserver for the openshift-apiserver-operator) may still be
// available, but the user intent cannot be fulfilled.
OperatorFailing ClusterStatusConditionType = "Failing"
+
+ // OperatorUpgradeable indicates whether or not the operator is in a state where it can accept a new payload. When
+ // set to `False` the CVO will disallow moving to a new version.
+ OperatorUpgradeable ClusterStatusConditionType = "Upgradeable"
You cannot have the CVO base its decision about whether or not to allow a
version change upon the operator status for the same reason you cannot
allow it make a decision based on operand status.
who's setting this "upgradeable" condition if not the operator?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#206 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AF8dbJYuzdrBwC2SOl-hchaX0xfjfA_eks5vMcIxgaJpZM4a0nQL>
.
|
yes. When the |
soltysh
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea is reasonable here imo, the discussions currently are around when and how which is fine but should not block the PR itself.
lgtm
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: deads2k, soltysh The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
config/v1/types_cluster_operator.go
Outdated
| // available, but the user intent cannot be fulfilled. | ||
| OperatorFailing ClusterStatusConditionType = "Failing" | ||
|
|
||
| // OperatorUpgradeable indicates whether or not the operator is in a state where it can accept a new payload. When |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't use payload, that word is being scrubbed, and don't use CVO. Use this description instead:
Upgradeable indicates whether the operator is in a state that is safe to upgrade. When status is
Falseadministrators should not upgrade their cluster and the message field should contain a human readable description of what the administrator should do to allow the operator to successfully update.
|
Change the godoc as per my comment. |
|
The current use cases described for this are:
There are a few future use cases we need to describe:
@abhinavdahiya and I had a long chat, I think what we allow right now is:
Once we have that much implemented, we'll go back and assess. Future CVO might have "precondition jobs" that we use to assess whether a version could be started (without actually updating anything). This would capture most of the "quick and dirty checks", assuming we don't have many of them. In the short term, the CVO will do the following
Operators should be Available=True if they satisfy the "reason they are part of the payload" - i.e. if they expose an API, if the API is deployed they are available true. If they have a single operand that should always be running, they are Available=true. If they have an optional (as defined by biz use case) operand, they are available "if they have either created the operand or are waiting on a human". If they have many operands, they should become available when "the default operands as defined by a human" are created. |
So applying this statement to the two devex use cases:
For both of these, we will want to continue to have actually useful conditions on the operator-specific resource that indicate the state of the operand and why. Unfortunately this means there's no consistent way to check that all operators and their operands are healthy/stable. Or at least that the ClusterOperator resource is not that mechanism. I think this largely goes against the current ClusterOperator api documentation(so that should be fixed) and from the perspective of an operator author trying to report useful status about the state of my operand, I think it's a step backwards, but if we are saying the CVO isn't going to switch to a different mechanism for determining when an operator is sufficiently stable such that the install/upgrade can continue, I guess that's what we have. /cc @openshift/sig-developer-experience |
Why? If you don't have an installed registry your health is "good".
Why? The specific condition of you don't have a pull secret to the registry is something the operator should report but requires human intervention (today). When we ship, it will again be an error. |
And when i do have an installed registry but the storage isn't configured/available, but the operator still reports available=true/progressing=false so that it doesn't block the install? I wouldn't consider that a registry that's in good health.
Because importing the samples takes time(thus making the install take longer) and having the samples available isn't critical to having a working cluster. We're being asked why we're delaying the install. |
There is no difference between a registry without configured storage and NetworkOperator Type=None - if you are waiting for human, you are available. |
That seems like a slippery slope. When no registry pods can be scheduled because of node issues that require a human to go add more capacity, am I available/healthy? When the storage is configured to a PVC but the autoprovisioner didn't generate a PV that's claimable, am I healthy because the admin just needs to go create a PV for me? If anything it seems particularly weird to report available=true when you know the admin needs to take some action to actually make the thing available. From an admin/end-user perspective, it doesn't matter why the registry isn't running, I need to know that it's not(and then i will care why and see if i need to do something about it). Clearly the clusteroperator "available=true" condition isn't the thing that tells the admin that. |
No, you aren't.
No you aren't. Those both seemed pretty easy? |
I don't think they look that different to an admin. The difference between "you didn't configure storage at all" and "you configured storage but it's not available, maybe you did it wrong" is pretty subtle, yet you're saying in one case the registry is "available" and in the other it's not. Not configuring storage is basically a special case of "you configured storage that isn't usable". And your criteria above was "if you're waiting for a human you're available". Both of the scenarios I presented are cases where you're also waiting for a human. So I don't see it as an easy distinction, but more importantly I don't see it as a useful distinction for an administrator who just cares if the thing is actually push/pullable or not (or upgradeable or not). Which gets to the larger point. There are two concerns:
|
|
It looks like most of the action in this pull is about the meaning of Available, not the meaning of Upgradeable. @smarterclayton doc updated |
|
doc may need clarification to indicate that only |
|
New changes are detected. LGTM label has been removed. |
|
I'm late to the game, but I'll add a little more detail to what makes this confusing. I'm working on the
It seems very confusing that I should actually be reporting |
|
Gonna add one more. If I set my operator |
|
However this: https://github.com/openshift/library-go/blob/master/pkg/operator/status/status_controller.go#L116 // if we are unmanaged, force everything to Unknown.
if detailedSpec.ManagementState == operatorv1.Unmanaged {
clusterOperatorObj.Status = configv1.ClusterOperatorStatus{}
configv1helpers.SetStatusCondition(&clusterOperatorObj.Status.Conditions, configv1.ClusterOperatorStatusCondition{Type: configv1.OperatorAvailable, Status: configv1.ConditionUnknown, Reason: "Unmanaged"})
configv1helpers.SetStatusCondition(&clusterOperatorObj.Status.Conditions, configv1.ClusterOperatorStatusCondition{Type: configv1.OperatorProgressing, Status: configv1.ConditionUnknown, Reason: "Unmanaged"})
configv1helpers.SetStatusCondition(&clusterOperatorObj.Status.Conditions, configv1.ClusterOperatorStatusCondition{Type: configv1.OperatorFailing, Status: configv1.ConditionUnknown, Reason: "Unmanaged"})
configv1helpers.SetStatusCondition(&clusterOperatorObj.Status.Conditions, configv1.ClusterOperatorStatusCondition{Type: configv1.OperatorUpgradeable, Status: configv1.ConditionUnknown, Reason: "Unmanaged"})
}makes me question what I previously stated. |
…onal update risk We've had Upgradeable since 2019 [1], but it is a confusing condition, because the "between minor versions" wording [2] in the message isn't obvious to users who have not yet internalized SemVer's MAJOR.MINOR.PATCH terminology [3]. Conditional update risks landed in 2021 [4] and give us a clear API for declaring exactly which update targets have the exposure. This commit adds client-side code to the tech-preview 'recommend' subcommand, so folks can get a feel for that user experience. If it goes over well, we can shift the logic to the cluster-version operator so all clients can benefit. The alreadyHasUpgradeableRisk check ensures we don't double up if the CVO eventually picks up this pivot. [1]: openshift/api#206 [2]: https://github.com/openshift/cluster-version-operator/blob/c4b8362d8acd08d63a600b4d53c33e8737ed7a53/pkg/cvo/upgradeable.go#L218-L228 [3]: https://semver.org/spec/v2.0.0.html#summary [4]: openshift/api#1011
…onal update risk We've had Upgradeable since 2019 [1], but it is a confusing condition, because the "between minor versions" wording [2] in the message isn't obvious to users who have not yet internalized SemVer's MAJOR.MINOR.PATCH terminology [3]. Conditional update risks landed in 2021 [4] and give us a clear API for declaring exactly which update targets have the exposure. This commit adds client-side code to the tech-preview 'recommend' subcommand, so folks can get a feel for that user experience. If it goes over well, we can shift the logic to the cluster-version operator so all clients can benefit. The alreadyHasUpgradeableRisk check ensures we don't double up if the CVO eventually picks up this pivot. [1]: openshift/api#206 [2]: https://github.com/openshift/cluster-version-operator/blob/c4b8362d8acd08d63a600b4d53c33e8737ed7a53/pkg/cvo/upgradeable.go#L218-L228 [3]: https://semver.org/spec/v2.0.0.html#summary [4]: openshift/api#1011
We cannot gate upgrading on available, failing, and progressing because if any of those conditions are "wrong", we may need to change the version in order to get them working. This introduces a new condition that indicates whether or not the operator can accept a payload change.
@bparees
/assign @smarterclayton